Introduction

The aim for this analysis was to investigate the molecular heterogenity in PDX (Patient-Derived-Xenografts) models, which are human tumor grafts transplanted in murine models in order to study their molecular evolution.

The project particularly focuses of the molecular signature of the ribosomal proteins (RP). Even though ribosomes are fundamental components for protein synthesis as they are housekeeping genes, recent studies hypothesize that theyr expression end chromatin accessibility can vary significantly among patients and tumor types also.

Through scRNA-seq (levels of expression) and scATAC-seq (to assess chromatin accessibility), the aim was to verify whether ribosomal genes alone could be sufficient to distinguish different PDX samples, revealing whether is true or not that they could be used to differentiate patients.

Data description

The dataset used for the analysis comes from GEO (NCBI) with GSE307240 code: all-together they are composed of five different PDX samples, and for each can be found a scRNA-seq (.h5) file and a scATAC-seq (.tsv.gz) file. In order to be able to compute the analysis inside a Docker container all the files were first read and then subsampled, extrapolating the data for only 500 cells for each patients: this allowed to analysis to have a balanced dataset and representative for 2500 cells

Results

In this section are presented the results of the integration of the five PDX samples, focusing exclusively on RP genes.

Uploading, subsampling and sparse matrix creation of the scRNA-seq dataset

The initial phase of the analysis involved managing raw scRNA-seq datasets from five distinct osteosarcoma Patient-Derived Xenograft (PDX) models. Due to the high density of the original data and the need to circumvent hardware limitations related to available RAM, a strategic subsampling approach was implemented.

The workflow consisted of the following key steps: (1) System paths were established to load counts from .h5 (Hierarchical Data Format) files, the standard output for 10x Genomics pipelines; (2) to ensure an equitable representation of each PDX model while drastically reducing the computational footprint, 500 cells were randomly selected from each patient. This allowed the analysis to focus on the ribosomal protein (RP) signature without exhausting system resources; (3) individual Seurat objects were merged into a single integrated dataset. To maximize efficiency, the data were stored as sparse matrices, a data structure that only records non-zero values, essential for scRNA-seq.

p1_path <- "../data/RNA/h5/GSM9219737_PDX_OS107_filtered_feature_bc_matrix.h5"
p2_path <- "../data/RNA/h5/GSM9219739_PDX_OS526_filtered_feature_bc_matrix.h5"
p3_path <- "../data/RNA/h5/GSM9219741_PDX_OS774_filtered_feature_bc_matrix.h5"
p4_path <- "../data/RNA/h5/GSM9219743_PDX_OS833_filtered_feature_bc_matrix.h5"
p5_path <- "../data/RNA/h5/GSM9219745_PDX_384_atac_filtered_feature_bc_matrix.h5"

p1_data <- Read10X_h5(p1_path)
p2_data <- Read10X_h5(p2_path)
p3_data <- Read10X_h5(p3_path)
p4_data <- Read10X_h5(p4_path)
p5_data <- Read10X_h5(p5_path)

p1_obj <- prepare_pdx_rna(p1_path, "P1", n_cells = 500)
p2_obj <- prepare_pdx_rna(p2_path, "P2", n_cells = 500)
p3_obj <- prepare_pdx_rna(p3_path, "P3", n_cells = 500)
p4_obj <- prepare_pdx_rna(p4_path, "P4", n_cells = 500)
p5_obj <- prepare_pdx_rna(p5_path, "P5", n_cells = 500)

path_rna <- "../data/RNA/"

saveRDS(p1_obj, paste0(path_rna, "p1_OS107_500cells.rds"))
saveRDS(p2_obj, paste0(path_rna, "p2_OS526_500cells.rds"))
saveRDS(p3_obj, paste0(path_rna, "p3_OS774_500cells.rds"))
saveRDS(p4_obj, paste0(path_rna, "p4_OS833_500cells.rds"))
saveRDS(p5_obj, paste0(path_rna, "p5_384_500cells.rds"))

merged_5pazienti_RNA <- merge(p1_OS107, y = c(p2_OS526, p3_OS774, p4_OS833, p5_384), 
                            add.cell.ids = c("P1", "P2", "P3", "P4", "P5"), 
                            project = "PDX_Integrated")

saveRDS(merged_5pazienti_RNA, "../data/RNA/merged_5pazienti_RNA.rds")

esporta_matrice_10x(merged_5pazienti_RNA)

scRNA-seq: RP gene expression analysis

Ribosomal Protein selection and Data Clustering

Once the integrated sparse matrix was generated, the analysis was narrowed down to specifically target the expression profiles of human Ribosomal Protein (RP) genes. This targeted approach allows for the identification of patient-specific translational signatures within the PDX models.

To perform filtering and clustering the human genome annotation file (Ensembl release 115, GRCh38) has been used to identify and isolate specific genomic features. The GTF was edited to retain only RP genes and associated pseudogenes, specifically targeting those with the “RP” prefix in their nomenclature.

Using the custom-filtered annotation, the expression matrix was subsetted to include only the selected ribosomal features.

The resulting object underwent a standard scRNA-seq pipeline:

An Elbow Plot (Fig. 1a) was generated to visualize the variance explained by each Principal Component (PC). This allowed for an informed selection of the optimal number of dimensions for downstream analysis.

After selecting the most significant PCs, a UMAP projection (Fig. 1b) was created. The visualization demonstrates a clear patient-specific signature, where cells cluster primarily by their PDX model of origin based solely on the expression of ribosomal protein genes.

risultato_pdx <- readRDS("../data/RNA/merged_5pazienti_RNA.rds")
path_gtf <- "../data/RNA/Homo_sapiens.GRCh38.115.gtf.gz"
risultato_pdx <- rimuovi_geni_topo(risultato_pdx)
solo_rp <- filtra_ribosomiali(risultato_pdx, path_gtf)

risultato_pdx <- analizza_rna_ribo(solo_rp)

stile_titolo <- theme(plot.title = element_text(size = 10, face = "bold", hjust = 0.5))

p1 <- Seurat::ElbowPlot(risultato_pdx, ndims = 20) + 
      ggtitle("1a. PCA Elbow Plot (Selection of Dimensions)") + stile_titolo

p2 <- Seurat::DimPlot(risultato_pdx, reduction = "umap", group.by = "orig.ident") +
      ggtitle("1b. Patient Specific Signature (RP Genes)") + stile_titolo

(p1 | p2)

Parameter sweep (Resolution and Dimensionality)

To evaluate whether the result obtained previously was actually due to differences in RPs, a sweep of multiple clustering settings has been performed, analyzing neighbors, dimensionality (the number of PCs) and resolution (clustering based on cells granularity).

  • For dimensionality control three graphs have been produced: one with 2 dimensions, one with 10 dimensions and one with 50 dimensions.

    The 2PC dimensionality reduction is not sufficient to correctly separate biological profiles; while the 50PC plot in figure 3b, when allowing 50 dimentions, causes technical noise that fragments the clusters. The optimal dimensional reduction was obtained with a dimensionality of 10PC.

  • For the resolution sweep the of 0.1, 0.2 and 1.2 resolutions were tested: the 0.1 one allows the identification of macro-differences between patients but is not enough to distinguish all five PDXs, while a 1.2 resolution cause over-clustering which is not supported by real biological differences. The 0.2 resolution represent the perfect equilibrium that allows for the discrimination of all five patient-profiles without fragmenting the clusters.

stile_titolo <- theme(plot.title = element_text(size = 10, face = "bold", hjust = 0.5))

# SWEEP DELLA RISOLUZIONE
res_base <- FindNeighbors(risultato_pdx, dims = 1:10, verbose = FALSE)
p_res_low <- DimPlot(FindClusters(res_base, resolution = 0.1, verbose = FALSE), group.by = "seurat_clusters") + 
             ggtitle("2a. Resolution: 0.1\n(under-clustering)") + stile_titolo
p_res_mid <- DimPlot(FindClusters(res_base, resolution = 0.2, verbose = FALSE), group.by = "seurat_clusters") + 
             ggtitle("2b. Resolution: 0.2\n(optimal)") + stile_titolo
p_res_high <- DimPlot(FindClusters(res_base, resolution = 1.2, verbose = FALSE), group.by = "seurat_clusters") + 
              ggtitle("2c. Resolution: 1.2\n(over-clustering)") + stile_titolo


# SWEEP DELLE DIMENSIONI (PC)
p_2pc <- DimPlot(RunUMAP(risultato_pdx, dims = 1:2, verbose = FALSE), group.by = "orig.ident") + 
         ggtitle("2d. Dimensionality: 2 PC") + stile_titolo
p_10pc <- DimPlot(RunUMAP(risultato_pdx, dims = 1:10, verbose = FALSE), group.by = "orig.ident") + 
          ggtitle("2e. Dimensionality: 10 PC") + stile_titolo
p_20pc <- DimPlot(RunUMAP(risultato_pdx, dims = 1:20, verbose = FALSE), group.by = "orig.ident") + 
          ggtitle("2f. Dimensionality: 20 PC") + stile_titolo

p_2pc <- p_2pc + theme(legend.position = "bottom")
p_10pc <- p_10pc + theme(legend.position = "none")
p_20pc <- p_20pc + theme(legend.position = "none")
  
(p_res_low | p_res_mid | p_res_high) / (p_2pc | p_10pc | p_20pc)

This parameter sweep shows that the optimal configuration for RP gene analysis has a PC of 10 and resolution of 0.1. This setup allows to balance biological differences avoiding technical noise cause by an excessive number of dimensions or due to an artificial fragmentation of the clusters.

Pre-processing of scATAC-seq data

GenomicRanges creation and Signac object creation

The chromatin accessibility workflow utilized the Signac framework. Before going to the scATAC-seq data, it was necessary to create a GenomicRanges (GRanges) object: this preparative passage allows for the transformation of textual coordinates in computational objects, and specifically in this analysis they were needed to define the specific genomic coordinates of interest for the analysis of scATAC-seq. This was achieved again by using the already filtered human GTF (Ensembl release 115) file to retain only ribosomal protein genes and pseudogenes.

Technically speaking, GRanges objects are useful for their ability to manage genomic complexity because they not only memorize a sequence, but they also integrate essential metadata such as the chromosome of origin and genetic annotations.

Without GRanges, operations like genomic intervals intersections and calculating the distances between peaks will be prone to errors. GRanges solve this issue by allowing very fast overlapping and query, granting spatial coherence of the data, making the integration of different layers, such as accessibility loci of the chromatin and the correspondent RNA expression, easier.

ribo_granges <- crea_granges_ribo(solo_rp)

print(ribo_granges)
## GRanges object with 141 ranges and 22 metadata columns:
##           seqnames              ranges strand |         source     type
##              <Rle>           <IRanges>  <Rle> |       <factor> <factor>
##     RPL22        1     6185020-6209389      - | ensembl_havana     gene
##     RPL11        1   23691685-23696835      + | ensembl_havana     gene
##   RPS6KA1        1   26529756-26575030      + | ensembl_havana     gene
##      RPA2        1   27891524-27914746      - | ensembl_havana     gene
##      RPS8        1   44775251-44778779      + | ensembl_havana     gene
##       ...      ...                 ...    ... .            ...      ...
##    RPL36A        X 101390824-101396155      + | ensembl_havana     gene
##     RPL39        X 119786497-119791662      - | ensembl_havana     gene
##     RPL10        X 154389955-154409168      + | ensembl_havana     gene
##    RPS4Y1        Y     2841602-2932000      + | ensembl_havana     gene
##    RPS4Y2        Y   20756108-20781032      + | ensembl_havana     gene
##               score     phase         gene_id gene_version   gene_name
##           <numeric> <integer>     <character>  <character> <character>
##     RPL22        NA      <NA> ENSG00000116251           12       RPL22
##     RPL11        NA      <NA> ENSG00000142676           15       RPL11
##   RPS6KA1        NA      <NA> ENSG00000117676           15     RPS6KA1
##      RPA2        NA      <NA> ENSG00000117748           11        RPA2
##      RPS8        NA      <NA> ENSG00000142937           13        RPS8
##       ...       ...       ...             ...          ...         ...
##    RPL36A        NA      <NA> ENSG00000241343           11      RPL36A
##     RPL39        NA      <NA> ENSG00000198918            9       RPL39
##     RPL10        NA      <NA> ENSG00000147403           19       RPL10
##    RPS4Y1        NA      <NA> ENSG00000129824           16      RPS4Y1
##    RPS4Y2        NA      <NA> ENSG00000280969            2      RPS4Y2
##              gene_source   gene_biotype transcript_id transcript_version
##              <character>    <character>   <character>        <character>
##     RPL22 ensembl_havana protein_coding          <NA>               <NA>
##     RPL11 ensembl_havana protein_coding          <NA>               <NA>
##   RPS6KA1 ensembl_havana protein_coding          <NA>               <NA>
##      RPA2 ensembl_havana protein_coding          <NA>               <NA>
##      RPS8 ensembl_havana protein_coding          <NA>               <NA>
##       ...            ...            ...           ...                ...
##    RPL36A ensembl_havana protein_coding          <NA>               <NA>
##     RPL39 ensembl_havana protein_coding          <NA>               <NA>
##     RPL10 ensembl_havana protein_coding          <NA>               <NA>
##    RPS4Y1 ensembl_havana protein_coding          <NA>               <NA>
##    RPS4Y2 ensembl_havana protein_coding          <NA>               <NA>
##           transcript_name transcript_source transcript_biotype         tag
##               <character>       <character>        <character> <character>
##     RPL22            <NA>              <NA>               <NA>        <NA>
##     RPL11            <NA>              <NA>               <NA>        <NA>
##   RPS6KA1            <NA>              <NA>               <NA>        <NA>
##      RPA2            <NA>              <NA>               <NA>        <NA>
##      RPS8            <NA>              <NA>               <NA>        <NA>
##       ...             ...               ...                ...         ...
##    RPL36A            <NA>              <NA>               <NA>        <NA>
##     RPL39            <NA>              <NA>               <NA>        <NA>
##     RPL10            <NA>              <NA>               <NA>        <NA>
##    RPS4Y1            <NA>              <NA>               <NA>        <NA>
##    RPS4Y2            <NA>              <NA>               <NA>        <NA>
##           transcript_support_level exon_number     exon_id exon_version
##                        <character> <character> <character>  <character>
##     RPL22                     <NA>        <NA>        <NA>         <NA>
##     RPL11                     <NA>        <NA>        <NA>         <NA>
##   RPS6KA1                     <NA>        <NA>        <NA>         <NA>
##      RPA2                     <NA>        <NA>        <NA>         <NA>
##      RPS8                     <NA>        <NA>        <NA>         <NA>
##       ...                      ...         ...         ...          ...
##    RPL36A                     <NA>        <NA>        <NA>         <NA>
##     RPL39                     <NA>        <NA>        <NA>         <NA>
##     RPL10                     <NA>        <NA>        <NA>         <NA>
##    RPS4Y1                     <NA>        <NA>        <NA>         <NA>
##    RPS4Y2                     <NA>        <NA>        <NA>         <NA>
##            protein_id protein_version     ccds_id
##           <character>     <character> <character>
##     RPL22        <NA>            <NA>        <NA>
##     RPL11        <NA>            <NA>        <NA>
##   RPS6KA1        <NA>            <NA>        <NA>
##      RPA2        <NA>            <NA>        <NA>
##      RPS8        <NA>            <NA>        <NA>
##       ...         ...             ...         ...
##    RPL36A        <NA>            <NA>        <NA>
##     RPL39        <NA>            <NA>        <NA>
##     RPL10        <NA>            <NA>        <NA>
##    RPS4Y1        <NA>            <NA>        <NA>
##    RPS4Y2        <NA>            <NA>        <NA>
##   -------
##   seqinfo: 70 sequences from an unspecified genome; no seqlengths

To ensure multi-omic consistency, the analysis was restricted to the exact same 500 randomly selected cells per patient used in the scRNA-seq workflow. To handle the raw epigenetic data for these specific cells, the scATAC-seq fragment files were pre-processed via system-level commands (gunzip, bgzip, and tabix). Then Tabix indexing was performed, a fundamental step for the analysis epigenetic component, primarily due to the massive and sparse nature of scATAC-seq data. While gene expression (RNA) files are relatively compact, ATAC fragment files can often reach several gigabytes in size, making it impossible to load the entire dataset into the computer’s RAM. Tabix creates a .tbi file that serves as an index or a genomic roadmap. Thanks to this index, the Signac software does not need to read the entire file from beginning to end to perform an analysis; instead, it can move to the exact genomic coordinates of the Ribosomal Protein (RP) genes. Without this block-compressed indexing (bgzip), querying the regions of interest would require prohibitive processing times and would inevitably lead to system crashes due to resource exhaustion.

The data were normalized using TF-IDF (Term Frequency-Inverse Document Frequency) and reduced via SVD (Singular Value Decomposition), a process known as LSI (Latent Semantic Indexing).

p1_ATAC_path <- "../data/ATAC/.tsv.gz/GSM9219736_PDX_OS107_atac_fragments.tsv"
p2_ATAC_path <- "../data/ATAC/.tsv.gz/GSM9219738_PDX_OS526_atac_fragments.tsv"
p3_ATAC_path <- "../data/ATAC/.tsv.gz/GSM9219740_PDX_OS774_atac_fragments.tsv"
p4_ATAC_path <- "../data/ATAC/.tsv.gz/GSM9219742_PDX_OS833_atac_fragments.tsv"
p5_ATAC_path <- "../data/ATAC/.tsv.gz/GSM9219744_PDX_384_atac_fragments.tsv"

# INDICIZZAZIONE
library(Rsamtools)
all_paths <- c(p1_ATAC_path, p2_ATAC_path, p3_ATAC_path, p4_ATAC_path, p5_ATAC_path)

for (path in all_paths) {
  if (!file.exists(paste0(path, ".tbi"))) {
    system(paste("tabix -p bed", path))
  }
}

# Creazione degli oggetti ATAC
p1_OS107_ATAC_ribo <- prepare_pdx_atac(p1_ATAC_path, ribo_granges, colnames(p1), "P1")
p2_OS526_ATAC_ribo <- prepare_pdx_atac(p2_ATAC_path, ribo_granges, colnames(p2), "P2")
p3_OS774_ATAC_ribo <- prepare_pdx_atac(p3_ATAC_path, ribo_granges, colnames(p3), "P3")
p4_OS833_ATAC_ribo <- prepare_pdx_atac(p4_ATAC_path, ribo_granges, colnames(p4), "P4")
p5_384_ATAC_ribo <- prepare_pdx_atac(p5_ATAC_path, ribo_granges, colnames(p5), "P5")

atac_merged_con_clusters <- merge(p1_OS107_ATAC_ribo, y = c(p2_OS526_ATAC_ribo, p3_OS774_ATAC_ribo, p4_OS833_ATAC_ribo, p5_384_ATAC_ribo),
                     add.cell.ids = c("P1", "P2", "P3", "P4", "P5"))

ATAC Clustering

As for scRNA-seq data, again the objective was to determine whether the patient-specific patterns observed in gene expression were also rooted in the chromatin architecture. By focusing on the accessibility of genomic regions corresponding to RP genes, the aim was to identify whether the “translational identity” of each PDX model is already encoded at the level of open chromatin before transcription occurs.

The resulting UMAP projection (Fig. 3) provides a high-resolution map of the epigenetic landscape of the 2,500 subsampled cells.

Several key observations can be made: (1) much like the scRNA-seq results, the cells do not form a single, uniform cluster. Instead, they organize into distinct groups that correlate strongly with the original patient identity. This indicates that chromatin accessibility at ribosomal loci is not a generic “housekeeping” feature but is highly divergent between different tumor models; (2) the clear separation suggests that each PDX model possesses a unique “epigenetic fingerprint.” For instance, if Patient A’s cells are localized far from Patient B’s cells on the UMAP, it implies that the regulatory elements (promoters and enhancers) governing their ribosomal proteins have significantly different levels of openness; (3) the fact that the ATAC-seq data mirrors the RNA-seq clustering confirms the robustness of the RP signature. It proves that the heterogeneity we observe is a biological reality of the PDX models and not a technical artifact of a single sequencing method.

From a biological standpoint, this visualization supports the hypothesis of “tumor-specialized ribosomes.” It suggests that different tumors maintain different “epigenetic programs” for their protein synthesis machinery, potentially to adapt to specific metabolic or oncogenic needs.

atac_obj <- carica_oggetto_atac("../data/ATAC/atac_merged_con_clusters.rds")
atac_obj <- processa_pdx_atac(atac_obj)
Seurat::ElbowPlot(atac_obj, ndims = 15, reduction = "lsi") + 
  ggtitle("3a.   Dimensionality Validation: LSI Elbow Plot")

DimPlot(atac_obj, group.by = "orig.ident") + ggtitle("3b.   Patient Specific ATAC Signature")

Dimensionality reduction was performed using Latent Semantic Indexing (LSI). We selected dimensions 2 through 15 for the downstream UMAP projection and clustering. The first LSI component was excluded as it often captures technical variation related to sequencing depth rather than biological signal. The range of 15 dimensions was chosen as a robust compromise to capture the inter-patient heterogeneity of the PDX models while minimizing stochastic noise.

Epigenetic Landscape Validation: Dimensionality and Resolution Sweeping

To ensure the robustness of the scATAC-seq analysis, a dual-parameter sweeping, using the same concept that was used for scRNA-seq sweeping, has been performed. This step is crucial to verify that the patient-specific clustering is a biological reality and not an artifact of specific parameter choices.

In the first row of Fig. 4a-c, three different ranges of Latent Semantic Indexing (LSI) components have been tested to define the UMAP space, consistently excluding the first dimension to avoid technical bias related to sequencing depth:

  • Under-fitting (2:5 LSI): Using too few dimensions results in a compressed projection (Fig. 4a) where the 5 PDX models are forced together, losing the fine-grained epigenetic differences between patients.

  • Optimal (2:15 LSI): Based on the Elbow Plot, this range (Fig. 4b) provides the clearest separation. The cells organize into distinct clouds corresponding to their patient of origin (orig.ident), confirming that the ribosomal protein (RP) chromatin accessibility signature is highly patient-specific.

  • Noise Inclusion (2:50 LSI): Increasing the dimensions further (Fig. 4c) does not improve separation; instead, it introduces stochastic noise that begins to blur the cluster boundaries, leading to a less defined global structure.

In the second row of Fig. 4d-f, Resolution Sweep is performed, using the optimal dimensionality (2:15 LSI). The resolution parameter tested were 0.1, 0.5 and 1,2, in order to find the most biologically meaningful granularity:

  • Broad Clusters (Res 0.1): This setting is too conservative (Fig. 4d), grouping all cells into a single large cluster and failing to recognize the inherent inter-patient heterogeneity.

  • Optimal (Res 0.5): This resolution (Fig. 4e) perfectly mirrors the experimental design, identifying 5 distinct clusters that map directly to the 5 PDX models. This “gold standard” confirms that the RP signature is the primary driver of epigenetic variation in our dataset.

  • Over-clustering (Res 1.2): High resolution (Fig. 4f) artificially fragments the data into 13 clusters. These sub-groups likely represent technical noise or minor cell-state variations rather than significant biological differences in the ribosomal chromatin program.

stile_titolo <- theme(plot.title = element_text(size = 12, face = "bold", hjust = 0.5))

# SWEEP DELLE DIMENSIONI (LSI)
p_low_dim  <- DimPlot(RunUMAP(atac_obj, reduction = "lsi", dims = 2:5, verbose = FALSE), 
                      group.by = "orig.ident") + 
              ggtitle("4a.   Dimensionality: 2:5 LSI (under-fitting)") + stile_titolo

p_mid_dim  <- DimPlot(RunUMAP(atac_obj, reduction = "lsi", dims = 2:15, verbose = FALSE), 
                      group.by = "orig.ident") + 
              ggtitle("4b.   Dimensionality: 2:15 LSI (optimal)") + stile_titolo

p_high_dim <- DimPlot(RunUMAP(atac_obj, reduction = "lsi", dims = 2:50, verbose = FALSE), 
                      group.by = "orig.ident") + 
              ggtitle("4c.   Dimensionality: 2:50 LSI (noise inclusion)") + stile_titolo


# SWEEP DELLA RISOLUZIONE
res_base_atac <- FindNeighbors(atac_obj, reduction = "lsi", dims = 2:15, verbose = FALSE)

p_res_low  <- DimPlot(FindClusters(res_base_atac, resolution = 0.1, verbose = FALSE), 
                      group.by = "seurat_clusters") + 
              ggtitle("4d.   Resolution: 0.1 (broad clusters)") + stile_titolo

p_res_mid  <- DimPlot(FindClusters(res_base_atac, resolution = 0.5, verbose = FALSE), 
                      group.by = "seurat_clusters") + 
              ggtitle("4e.   Resolution: 0.5 (optimal)") + stile_titolo

p_res_high <- DimPlot(FindClusters(res_base_atac, resolution = 1.2, verbose = FALSE), 
                      group.by = "seurat_clusters") + 
              ggtitle("4f.   Resolution: 1.2 (over-clustering)") + stile_titolo

(p_low_dim | p_mid_dim | p_high_dim) / (p_res_low | p_res_mid | p_res_high)

The integration of these results justifies the final selection of dimensions 2:15 and resolution 0.5. This configuration captures the maximum biological signal with minimum noise, demonstrating that each patient maintains a unique and stable “epigenetic fingerprint” regarding ribosomal protein gene accessibility.

Quantitative Validation: Contingency Table Analysis

To conclude the validation of our scATAC-seq analysis, we generate a contingency table to objectively quantify the alignment between the computational clusters and the biological origin of the cells.

The ultimate goal of testing multiple clustering parameters is to assess whether PDX samples can be separated based on the chromatin accessibility state of ribosomal protein loci. While UMAP visualizations provide a qualitative overview, a contingency table provides the statistical proof of this separation.

This table cross-references the clusters identified by the algorithm (at the optimal resolution of 0.5) with the original patient metadata . A “clean” separation is indicated by a one-to-one (or nearly one-to-one) correspondence, where each cluster is almost exclusively composed of cells from a single patient. This confirms that the ribosomal protein (RP) chromatin signature is a robust, patient-specific feature rather than a result of stochastic noise or technical batch effects.

Validazione Cluster vs Paziente
OS107 OS384 OS526 OS774 OS833
0 343 218 338 274 287
1 56 78 54 115 94
2 55 131 36 27 25
3 15 56 34 79 71
4 31 17 38 5 23

Discussion

During the initial stages of the analysis, the RStudio environment encountered continuous crashes and session restarts, particularly when reading and also during the merging of high-dimensional scRNA-seq and scATAC-seq datasets from multiple patients.

By monitoring the system’s resource allocation in real-time using the system(“free -m”) command, i discovered a critical memory bottleneck, as the process was hitting the default hardware limits of the virtualized environment.

To fix this, i had to manually tell Windows to give more ‘breathing room’ to the Linux system (WSL2) that runs Docker. To resolve the issue i had to go into the main User folder on the computer and created a simple text file named .wslconfig. In this file, i changed the default limits by writing these exact settings:

[wsl2] memory=6GB swap=4GB

After restarting the PC, the system finally had enough memory to handle the large datasets without crashing.

This adjustment was also supported by the use of manually triggered garbage collection (gc()) within the R script, ensuring that the heavy computational tasks—such as fragment object creation and large-scale merging—could be completed successfully without further system failures.

Data Management and Storage Challenges

Furthermore, during the final stages of the project, significant challenges were encountered regarding data storage and synchronization across platforms:

In conclusion, these adjustments in data handling and system configuration were essential to overcome hardware limitations, ensuring a stable and reproducible workflow within both the GitHub and Docker ecosystems.